System and method for finding information in a distributed information system using query learning and meta search

ABSTRACT

An information retrieval system finds information in a Distributed Information System (DIS), e.g. the Internet using query learning and meta search for adding documents to resource directories contained in the DIS. A selection means generates training data characterized as positive and negative examples of a particular class of data residing in the DIS. A learning means generates from the training data at least one query that can be submitted to any one of a plurality of search engines for searching the DIS to find “new” items of the particular class. An evaluation means determines and verifies that the new item(s) is a new subset of the particular class and adds or updates the particular class in the resource directory.

RELATED APPLICATION

Provisional Application, Ser. No. 60/015,231, filed Apr. 10, 1996 andassigned to the same assignee as that of the present invention.

APPENDIX ON CD-ROM

Appendix 3 to this Specification is a computer program listing appendix,submitted on a CD and incorporated by reference in its entirety.

Notice

This document discloses source code for implementing the invention. Nolicense is granted directly, indirectly or by implication to the sourcecode for any purpose by disclosure in this document except copying forinformational purposes only or as authorized in writing by the assigneeunder suitable terms and conditions.

Appendix on CD-ROM

Appendix 3 to this Specification is a computer program listing appendix,submitted on a CD and incorporated by reference in its entirety.

BACKGROUND OF THE INVENTION

(1) Field of the Invention

This invention relates to information retrieval systems. Moreparticularly, the invention relates to information retrieval indistributed information system, e.g Internet using query learning andmeta search.

(2) Description of the Prior Art

The World Wide Web (WWW) is currently filled with documents that collecttogether links to all known documents on a topic; henceforth, we willrefer to documents of this sort as resource directories. While resourcedirectories are often valuable, they can be difficult to create andmaintain. Maintenance is especially problematic because the rapid growthin on-line documents makes it difficult to keep a resource directoryup-to-date.

This invention proposes to describe machine learning methods to addressthe resource directory maintenance problem. In particular, we propose totreat a resource directory as an extensional definition of an unknownconcept i.e. documents pointed to by the resource list will beconsidered positive examples of the unknown concept, and all otherdocuments will be considered negative examples of the concept. Machinelearning methods can then be used to construct from these examples anintensional definition of the concept. If an appropriate learning methodis used, this definition can be translated into a query for a WWW searchengine, such as Altavista, Infoseek or Lycos. If the query is accurate,then re-submitting the query at a later date will detect any newinstances of the concept that have been added. We will presentexperimental results on this problem with two implemented systems. Oneis an interactive system an augmented WWW browser that allows the userlabel any document, and to learn a search query from previously labeledexamples. This system is useful in locating documents similar to thosein a resource directory, thus making it more comprehensive. The other isa batch system which repeatedly learns queries from examples, and thencollects and labels pages using these queries. In labeling examples,this system assumes that the original resource directory is complete,and hence can only be used with a nearly exhaustive initial resourcedirectory; however, it can operate without human intervention.

Prior art related to machine learning methods includes the following:

U.S. Pat. No. 5,278,980 issued Jan. 11, 1994 discloses an informationretrieval system and method in which an operator inputs one or morequery words which are used to determine a search key for searchingthrough a corpus of a document, and which returns any matches betweenthe search key and the corpus of a documents as a phrase containing theword data matching the query word(s), a non-stop (content) word nextadjacent to the matching work data, and all intervening stop—wordsbetween the matching word data and the next adjacent non-stop word. Theoperator, after reviewing one or more of the returned phrases can thenuse one or more of the next adjacent non-stop words as new query wordsto reformulate the search key and perform a subsequent search throughthe document corpus. This process can be conducted iteratively, untilthe appropriate documents of interest are located. The additionalnon-stop words for each phrase are preferably aligned with each other(e.g., columination) to ease viewing of the “new” content words.

Other prior art related to machine learning methods is disclosed in thereferences attached to the specification as Appendix 1.

None of the prior art discloses a system and method of adding documentsto a resource directory in a distributed information system by using alearning means to generate from training data a plurality of items aspositive and/or negatives examples of a particular class and using alearning means to generate at least one query that can be submitted toany of a plurality of methods for searching the system for a new item,after which the new item is evaluated by learning means with the aim ofverifying that the new item is a new subset of the class.

SUMMARY OF THE INVENTION

An information retrieval system finds information in a DistributedInformation System (DIS), e.g. the Internet using query learning andmeta search for adding documents to resource directories contained inthe DIS. A selection means generates training data characterized aspositive and negative examples of a particular class of data residing inthe DIS. A learning means generates from the training data at least onequery that can be submitted to any one of a plurality of search enginesfor searching the DIS to find “new” items of the particular class. Anevaluation means determines and verifies that the new item(s) is a newsubset of the particular class and adds or updates the particular classin the resource directory.

DESCRIPTION OF THE DRAWING

FIG. 1 is a representation of a prior art distributed information systemwhich implements the principles of the present invention.

FIG. 2 is a listing of pseudo code for a batch-query —learnerincorporating the principles of the present invention in the system ofFIG. 1.

FIG. 3 is a representation of an interactive query—learning systemincorporating the principles of the present invention.

FIG. 4. is a user interface to the query learning system of the presentinvention.

FIG. 5 is a listing of pseudo code for an on line prediction algorithmincorporating the principles of the present invention.

FIG. 6 is a Table summarizing experiments with the learning system ofFIG. 3.

FIG. 7 is a Table summarizing experiments with the learning system ofFIG. 2.

FIGS. 88A, 8B, 8C and 8D are graphs of data showing the results ofprecision—recall tradeoff for the three problems studied the batch querylearning system of FIG. 2.

FIG. 9 is a Table of results of a generalization error study for thelearning systems of FIG. 2 and FIG. 5.

DESCRIPTION OF PREFERRED EMBODIMENTS

The problem addressed by the present invention is a variant of theproblem of relevance feedback, which is well—studied in informationretrieval. One novel aspect of the present invention (other than theWWW-based setting) is that we will focus, as much as is practical, onlearning methods that are independent of the search engine used toanswer a query. This emphasis seems natural in a WWW setting, as thereare currently a number of general—purpose WWW search engines, all underconstant development, none clearly superior to the others, and nonedirectly supporting relevance feedback (at the time of thisapplication); hence it seems inappropriate to rely too heavily on asingle search engine. The current implementation can use any of severalsearch engines. A second motivation for investigating search engineindependent learning methods is that there are many search enginesaccessible from the WWW that index databases partially or entirelyseparate from the WWW. As WWW browsers and the Common Gateway Interface(CGI) now provide a nearly uniform interface to many search engines, itseems reasonable to consider the problem of designing general—purposerelevance feedback mechanisms that require few assumptions to be madeabout the search engine.

A distributed information system 10, e.g., the Internet to which theinvention is applicable is shown in FIG. 1 The Internet is furtherdescribed in the text “How The Internet Works” by Joshua Eddings,published by Ziff Davis, 1994. The system includes a plurality ofprocessors 12 and related databases 14 coupled together through routers(not shown) for directing messages among the processors in accordancewith network protocols. Each processor and related database is coupledto a plurality of users through servers (not shown). The users mayoriginate messages for purposes of communication with other users and/or search the system for information using search engines.

The initial research goal was to implement a WWW-based query—learningsystem in the system of FIG. 1 and support meaningful experimentation toprovide a qualitative evaluation of the difficulty of the task.

To conduct this initial evaluation two different systems wereimplemented: one designed for batch use, and the other designed forinteractive use, as will be described hereinafter.

A Batch System

The first implementation is a Perl script that runs as a “batch” system—it requires no user intervention. The input of the batch system is alist of Uniform Resource Locators (URL's) that correspond to thepositive examples of an unknown concept. The batch system has twooutputs: an intensional representation of the unknown concept, and a setof example documents that include all of the positive examples plus asample of negative examples.

The procedure used to accomplish this is shown in FIG. 2. Threesubroutines are used. The first, Learn comprehends a concept from asample. The only assumption made by the query—learning system about thelearning system is that the hypothesis of the learning system is indisjunctive normal form (DNF), where the primitive conditions test forthe presence of words. For example, a DNF hypothesis learned from aresource list on college basketball might be:

(college basketball) V (college A hoops) V (NCAA A basketball)

Henceforth we will call each term (conjunction) in this DNF a “rule”.

A set of k rules can be easily converted to k search queries, each ofwhich consists of a conjunction of words —a query format that issupported by practically every search engine. The restriction,therefore, makes the system largely independent of the search engineused.

The second subroutine used by the query—learning system, CorrespondingQuery, converts a single rule to a query for the search engine beingused. Some knowledge about the search engine is clearly needed toappropriately encode the query; however, because most search engines usesimilar formats for queries, adding the knowledge needed to support anew search engine is usually straightforward. Some search engines canhandle more expressive queries —queries that require terms to appearnear each other, or queries that contain word stems like “comput$*$”.Most advanced queries are not currently supported by the existingCorresponding Query routine. One exception are queries containingconditions that check for the absence (rather than the presence) ofwords, such as (basketball NCAA). These can be used if both the learningsystem and the query system allow it, but were not used in any of theexperiments of this invention.

The final subroutine,Top-k-Documents}, submits a query to a searchengine and collects the top k documents returned. Again, some knowledgeabout the search engine is needed to perform this task.

The basic procedure followed by the batch query—learner is to repeatedlylearn a set of rules, convert these rules to queries, and then useincorrect responses to these queries as negative examples. The premisebehind this approach is that the responses to learned queries will bemore useful than randomly selected documents in determining the boundaryof the concept. Although this simple method works reasonably well, andcan be easily implemented with existing search engines, we suspect thatother strategies for collecting examples may be competitive or superior;for instance, promising results have been obtained with “uncertaintysampling. See Lewis and Gale (16) and query—learning by committee. SeeSeung et al (25). Also see Dagan and Engelson (10).

A few final details require some discussion.

Constraining the initial query: To construct the first query, a largeset of documents were used as default negative examples. A “defaultnegative example” is treated as a ordinary negative example unless ithas already been labeled as positive example, in which case the exampleis ignored. We used 363 documents collected from a cache used by ourlabs' HTTP proxy server as default negative examples.

Termination: In the current implementation, the process of learningrules and then collecting negative examples is repeated until someresource limit set by the user is exceeded. Currently the user can limitthe number of negative examples collected, and the number of times thelearning system is called.

Avoiding looping: It may be that on a particular iteration, no newdocuments are collected. If this occurs, then the training data on thenext iteration will be the same as the training data on the previousiteration, and the system will loop. To avoid this problem, if no newdocuments are collected on a cycle, heuristics are used to vary theparameters of the learning system for the next cycle. In the currentimplementation, two heuristics are followed: if the hypothesis of thelearning system is an empty rule set, then the cost of a false negativeis raised; otherwise, the cost of a false positive is raised. The properapplication of these heuristics, of course, depends on the learningsystem being used.

An Interactive System

The batch system assumes that every document not on the resource list isa negative example. This means that it cannot be successfully usedunless one is confident that the initial set of documents is reasonablycomplete. Our experience so far is that this is seldom the case. Forthis reason, we also implemented an interactive query—learning system,which does not assume completeness of an initial set of positiveexamples; instead, it relies on the user to provide appropriate labels.

The interactive system does not force any particular fixed sequence forcollecting documents and labeling; instead it is simply an augmented WWWbrowser, which allows the user to label the document being browsed, toinvoke the learning system, or to conduct a search using previouslylearned rules.

The architecture of the interactive system is shown in FIG. 3. Theuser's interface to the query—learning system is implemented as aseparate module that is interposed between a WWW browser and an HTTPproxy server. This module performs two main jobs. First, every HTMLdocument that is transmitted from the proxy server to the browser isaugmented, before being sent to the browser, by adding a small amount oftext, and a small number of special links at the beginning of thedocument. Second, while most HTTP requests generated by the browser arepassed along unmodified to the proxy server, the HTTP requests that aregenerated by clicking on the special inserted links are trapped out andtreated specially.

This implementation has the advantage of being browser—independent.Following current practice, an acronym Surfing While Inducing Methods toSearch for URLs or SWIMSUIT has been assigned to the system. The user'sview of the query—learning system is a set of special links that appearat the top of each HTML page. Clicking on these links allows the user toperform operations such as classifying a document or invoking thelearning system.

Functionally, the special links inserted by the query—learning interfaceact as additional “control buttons” for the browser —similar to thebuttons labeled “Back” and “Net Search” on the Netscape browser. Byclicking on special links, the user can classify pages, invoke thelearning system, and so on. The user's view of the interactive system isshown in FIG. 4.

The special links are:

Document labeling: The yes link and no link allow the user to classifythe current page as a positive (respectively negative) example of thecurrent class.

Invoking the learner: The learn link returns a form that allows the userto set options for the actual learning system and/or invoke the learneron the current class. The behavior of this link can be easily changed,so that different learning systems can be used in experiments. As in thebatch system, learning is normally constrained by using default negativeexamples. This means that reasonable rules can often be found even ifonly a few positive examples are marked.

Searching: The search link returns a list of previously learned rules.Clicking on any rule will submit the corresponding query to thecurrently selected search engine, and return the result.

Configuration and help: The set options link returns a form that allowsthe user to change the current class (or to name anew class), or tochange the current search engine; the review previous link returns anHTML page that lists all previously marked examples of the currentclass; and the help link returns a help page.

Learning Systems

Two learning systems have been integrated with the system: RIPPER, apropositional rule learner that is related to FOIL, see Quinlan (21),and a rule—learning version of “Sleeping experts”. Sleeping experts is anew prediction algorithm that combines ideas from used for onlineprediction, see Freund (11) with the infinite attribute model of Blum(3).

These algorithms have different strengths and weaknesses. RIPPERimplicitly assumes that examples are i.i.d —which is not the case forsamples collected via browsing or by the batch query—learning system.However, formal results suggest that sleeping experts will perform welleven on data sets that are selected in a non-random manner. The sleepingexperts algorithm is also largely incremental, which is potentially anadvantage is this setting. On the other hand, sleeping experts uses amore restricted hypothesis space, and cannot learn large rules, whereasRIPPER can (at least in principle).

RIPPER

Briefly, RIPPER builds a set of rules by repeatedly adding rules to anempty ruleset until all positive examples are covered. Rules are formedby greedily adding conditions to the antecedent of a rule with an emptyantecedent until no negative examples are covered. After a ruleset isconstructed, a optimization postpass massages the ruleset so as toreduce its size and improve its fit to the training data. A combinationof cross-validation and minimum-description length techniques are usedto prevent overfitting. In previous experiments, RIPPER was shown to becomparable to C4.5rules, Quinlan (22) in terms of generalizationaccuracy, but much faster for large noisy datasets. For more detail, seeCohen (8).

The version of RIPPER used here was extended to handle “set—valuedfeatures”, as described in Cohen (9). In this implementation of RIPPER,the value of a feature can be a set of symbols, rather than (say) anumber or a single symbol. The primitive conditions that are allowed fora set—valued feature F are of the form c ε F, where c is any constantvalue that appears as a value of F in the dataset. This leads to anatural way of representing documents: a document is represented by asingle feature, the value of which is the set of all tokens appearing inthe document.. In the experiments, documents were tokenized by deletinge-mail addresses, HTML special characters, and HTML markup commands;converting punctuation to spaces; converting upper to lower case;removing words from a standard stoplist, Lewis (17) and finally treatingevery remaining sequence of alphanumeric characters as a token. To keepperformance from being degraded by very large documents, we only usedtokens from the first 100 lines of a file. This also approximates thebehavior of some search engines, which typically index only the initialsection of a document.

A second extension to RIPPER allows the user to specify a loss ratio,see Lewis and Catlett (14). A loss ratio indicates the ratio of the costof a false negative error to the cost of a false positive error; thegoal of learning is to minimize total misclassification cost, ratherthan simply the number of errors, on unseen data. Loss ratios in RIPPERare implemented by changing the weights given to false positive errorsand false negative errors in the pruning and optimization stages of thelearning algorithm.

One additional modification to RIPPER was also made specifically toimprove performance on the query—learning task. The basic RIPPERalgorithm is heavily biased toward producing simple, and hence general,conjunctions; for example, for RIPPER, when a conjunction of conditionsis specific enough to cover no negative examples, no further conditionswill be added. This bias appears to be inappropriate in learningqueries, where the concepts to be learned are typically extremelyspecific. Thus, we added a postpass to RIPPER that adds to each of ruleal 1 conditions that are true for every positive covered by the rule.Actually, the number of conditions added was limited to a constant k —inthe experiments below, to k=20. Without this restriction, a rule thatcovers a group of documents that are nearly identical could be nearly aslong as the documents themselves; many search engines do not gracefullyhandle very long queries. We note that a similar scheme has beeninvestigated in the context of the “small disjunct problem”, see Holte(14). The postpass implements a bias towards specific rules rather thangeneral rules.

Sleeping Experts

In the past years there has been a growing interest in online predictionalgorithms. The vast majority of the prediction algorithms are given apool of fixed “experts”—each of which is a simple, fixed, classifier—and build a master algorithm, which combines the classifications of theexperts in some manner. Typically, the master algorithm classifies anexample by using a weighted combination of the predictions of theexperts. Building a good master algorithms thus a matter of finding anappropriate weight for each of the experts. Formal results show that byusing a multiplicative weight update, see Littlestone (18), the masteralgorithm is able to maintain a set of weights such that the predictionsof the master algorithm are almost as good as the best expert in thepool, even for a sequence of prediction problems that is chosen by anadversary.

The sleeping experts algorithm is a procedure of this type. It is basedon two recent advances in multiplicative update algorithms. The first isa weight allocation algorithm called Hedge, due to Freund and Schapire,see Freund (11), which is applicable to a broad class of learningproblems and loss functions. The second is thenfinite attribute model ofBlum (3) In this setting, there may be any number of experts, but only afew actually post predictions on any given example; the remainder aresaid to be “sleeping” on that example. A multiplicative update algorithmfor the infinite attribute model (based on Winnow, Littlestone(19) hasalso been implemented, see Blum (4).

Below we summarize the sleeping experts procedure, which combines theHedge algorithm with the infinite attribute model to efficientlymaintain an arbitrarily large pool of experts with an arbitrary lossfunction.

The Master Algorithm

Pseudo-code for the algorithm is shown in FIG. 5. The master algorithmmaintains a pool, which is a set recording which experts have beenactive on any previous example, and a set of weights, denoted by ρ, forevery expert in the pool. At all times, all weights in ρ will benon-negative. However, the weights need not sum to one. At each timestep t, the learner is given a new instance X_(t) to classify;the masteralgorithm is then given a set W_(t) of integer indices, which representthe experts that are active (i.e., not “sleeping”) on X_(t). Theprediction of expert i on X_(t) is denoted by

y^(t) _(i)

Based on the experts in W_(t), the master algorithm must make aprediction for the class of X_(t), and then update the pool and theweight set p.

To make a prediction, the master algorithm decides on a distribution{overscore (ρ)} over the active experts, which is determined byrestricting the set of weights ρ to the set of active experts W_(t), andnormalizing the weights. We denote the vector of normalized weights by{overscore (ρ)},${\,_{pi}^{\sim t}{= {p_{i}^{t}/{\sum\limits_{j\quad ɛ\quad W_{t}}p_{j}^{t}}}}};$

The prediction of the master algorithm is$F_{\beta}\left( {\sum\limits_{i\quad ɛ\quad W_{t}}{{}_{}^{\sim t}{}_{}^{}}} \right)$

We use F (r)=1n(1−r+rβ)/(1n(1−r+rβ)+1n 1n((1−r)β+r)), the function usedby Vovk (26) for predicting binary sequences.

Each active expert is the t_(i) suffers some “loss”

In the implementation described here, this loss is 0 if the expert'sprediction is correct and 1 otherwise.

Next, the master algorithm updates the weights of the active expertsbased on the losses. (The weight of the experts who are asleep remainsthe same, hence we implicitly set

∀i∉W_(t):P_(i) ^(t+1)

).When an expert is first encountered its weight is initialized to 1. Ateach time step t, the master algorithm updates the weights of the activeexperts as follows,${\forall{i \in {W_{t}:p_{i}^{t + 1}}}} = {\frac{1}{Z}p_{i}^{t}{{U_{\beta}\left( l_{i}^{t} \right)}.}}$

where Z is chosen such that${\sum\limits_{i\quad ɛ\quad W_{t}}p_{i}^{t}} = {\sum\limits_{i\quad ɛ\quad W_{t}}p_{i}^{t + 1}}$

The “update function” U_(S) is any function satisfying [Cesa-Binachi etal., 5)β^(t)≦U₃(r)≦1−(1−β)r. In our implementation, we used the linearupdate. U₆₂ (r)=1−(1−β)r, which is simple to implement and it avoidsexpensive exponentiations.

Briefly, if one defines the loss of the master algorithm to be theaverage loss with respect to the distribution( _(pi)^( ∼ t)|  ɛ  W_(t)}.

the cumulative loss of the master algorithm over all t can be boundedrelative to the loss suffered by the best possible fixed weight vector.These bounds hold for any sequence of examples (x₁,y₁), . . .(x_(t),y_(t)), in particular, the bounds hold for sequences whoseinstances are not statistically independent.

The Pool of Experts

It remains to describe the experts used for WWW page classification. Inour experiments each expert is corresponds to a space that appears in adocument. That is, if ω_(t) is the ith token appearing in the document,each expert is of the form ω_(i1)ω_(i2) . . . ω_(ik) where 1≦i_(l)<i₂ <.. . i_(j-l)and i_(k)−i_(l)<n. This is a generalization of thengram/footnote model. Note that our goal is to classify WWW documents;hence each ngram expert is used to predict the classification of thedocument in which it appears, rather than the next token (word). Foreach ngram we construct two mini-experts, one which always predicts 0(not in the class), and one that always predicts 1. The loss of eachmini-expert is either % 0 or 1 depending on the actual classification ofthe document.

Extracting Rules From Experts

Finally, heuristics are used to construct rules based on the weightsconstructed by the sleeping experts algorithm. We constructed a rule foreach expert predicts that 1 and that has a large weight. This is done byscanning the weights of the combined experts (each combined expertcontaining two mini-experts) and selecting those which have largeweight. More formally, an expert i is used to construct a rule if${{p_{i}^{T}/{\sum\limits_{j\quad \in}{{Pool}\quad P_{j}^{T}}}} \geq w_{m\quad i\quad n}},$

where T is the number of training examples, and w_(min) is a weightthreshold for extracting experts. In practice, we have found that mostof the weight is often concentrated on few experts, and hence the numberof experts extracted is not too sensitive to particular choices ofw_(min). We used w_(min)=0.0625 and set the learning rate β to be 0.5 inthe experiments described below.

Typically, the “heavy” experts correspond to phrases that frequentlyappear in documents labeled as positive examples; however, they may alsoappear in many of the negative labeled documents. We therefore examinedthe mini-experts of each extracted expert and selected those expertswhich are statistically correlated only with the positive examples. Wedefine the average prediction ρ_(i) of expert i, based on its twomini-experts (i,0) and (i, 1), to beρ_(i)=F₃(ρ_(i,0)/(ρ_(i,0)+ρ_(i.1))). An expert is finally chosen to beused as a rule if its average prediction is larger than ρ_(min). In theexperiments we used ρ_(min)=0.95 as the default value, and increased ordecreased this threshold to encourage proportionally more or fewerpositive predictions.

Finally, as was done with RIPPER, we add to each rule the list of alltokens that appear in all positive documents covered by a rule. We alsoremove all rules that have strictly fewer conditions than another rulein the set. The result is a rule set where each rule is of the formw_(i1)ΛW_(i2)Λ. . . ΛW_(ik).

Although the sleeping experts algorithm treats this as an ngram, wecurrently treat it simply as a conjunction of features: clearly, this issuboptimal for search engines which support proximity queries.

Experimental Results

We have evaluated the system with three resource directories.

ML courses is part of an excellent machine learning resource maintainedby David Aha^(l). This list contained (at the time the experiments wereconducted) pointers to 15 on-line descriptions of courses.

AI societies is a WWW page jointly maintained by SIGART, IJCAI, andCSCSI. It contains pointers to nine AI societies.

Jogging strollers. This is a list of pointers to discussions of,evaluations of, and advertisements for jogging and racing strollers.

Our initial goal was to find resource directories that were exhaustive(or nearly so) containing virtually all positive examples of some narrowcategory. Our hope was that systematic experiments could then be carriedout easily with the batch system. However, finding such a resourceturned out to be much harder than we expected.

We began with the MLcourse problem, which as a narrow section of afrequently—used resource we expected to be comprehensive; however,preliminary experiments showed that it was not. (The first queryconstructed by the batch system using RIPPER etrieved (from Altavista)17 machine learning course descriptions in the first 20 documents;however, only 5 of these were from the original list.). For thoseinterested in details, this query was

(course $wedge$ machine $wedge$ instructor $wedge$ learning)

Our next try at finding a comprehmachine learning course descriptions inthe first 20 documents; ensive resource directory was the AI societiesproblem; this directory had the advantage (not shared by the ML coursedirectory) that it explicitly stated a goal of being complete. However,similar experiments showed it to be quite incomplete. We then made aneffort to construct a comprehensive list with the jogging strollersproblem. This effort was again unsuccessful, in spite of spending abouttwo hours with existing browsers and search engines on a topicdeliberately chosen to be rather narrow.

We thus adopted the following strategy. With each problem we began byusing the interactive system to expand an initial resource list. Afterthe list was expanded, we invoked the batch system to collect additionalnegative examples and thus improve the learned rules.

Experiments With The Interactive system

We used the interactive system primarily to emulate the batch system;the difference, of course, being that positive and negative labels wereassigned to new documents by hand, rather than assuming all documentsnot in the original directory are negative. In particular, we did notattempt to uncover any more documents by browsing, or hand-constructedsearches. However, we occasionally departed from the script by varyingthe parameters of the learning system (in particular, the loss ratio),changing search engines, or examining varying numbers of documentsreturned by the search engines. We repeated the cycle of learning,searching, and labeling the results, until we were fairly sure that nonew positive examples would be discovered by this procedure.

FIG. 6 summarizes our usage of the interactive system. We show thenumber of entries in each initial directory, the term Recal is thefraction of the time that an actual positive example is predicted to bepositive by the classifier, and the term precision is the fraction ofthe time that an example predicted to be positive is actually positive.For convenience, we will define the precision of a classifier thatalways prefers the class negative as 1.00 of the initial directoryrelative to the final list that was generated, as well as the number oftimes a learner was invoked, the number of searches conducted, and thetotal number of pages labeled. We count submitting a query for each ruleas a single search, and do not count the time required to label theinitial positive examples. Also, we typically did not attempt to labelevery negative example encountered in the search.

To summarize, the interactive system appears to be very useful in thetask of locating additional relevant documents from a specific class; ineach case the number of known relevant documents was at leastquadrupled. The effort involved was modest: our use of the interactivesystem generally involved labeling a few dozen pages, waiting for theresults a handful of searches, and invoking the learner a handful oftimes. In these experiments the time required by the learner istypically well under 30 seconds on a Sun 20/60.

Experiments With The Batch System

In the next round of experiments, we invoked the batch system for eachof these problems. FIG. 7 shows the resource limit set for each of theseproblems (the column “/#Iterations Allowed” indicates how many times thelearning system could be called), the number of documents k that werecollected for each query, and the total number of documents collected bythe batch system (not including the initial set of 363 default negativeexamples). The resource limits used do not reflect any systematicattempt to find optimal limits. However, for the last two problems, thelearner seemed to “converge” after a few iterations, and output a singlehypothesis (or in one case alternate between two variants of ahypothesis) on all subsequent iterations.} In each case, RIPPER was usedas the learning system.

We then carried out a number of other experiments using the datasetscollected by the batch system. One goal was simply to measure howsuccessful the learning systems are in constructing an accurateintensional definition of the resource directories. To do this we re-ranthe learning systems on the datasets constructed by the batch system,executed the corresponding queries, and recorded the recall andprecision of these queries relative to the resource directory used intraining. To obtain an idea of the tradeoffs that are possible, wevaried the number of documents k retrieved from a query and parametersof the learning systems (for RIPPER, the loss ratio, and for sleepingexperts, the threshold ρ_(min).) Altavista was used as the searchengine.

The results of this experiment are shown in the graphs of Fig. 8. Thefirst three graphs show the results for the individual classes and thesecond graph shows the results for all three classes together.Generally, sleeping experts generates the best high-precisionclassifiers. However, its rulesets are almost always larger than thoseproduced by RIPPER; occasionally they are much larger. This makes themmore expensive to use in searching and is the primary reason that RIPPERwas used in the experiments with the batch and interactive systems.

The constructed rulesets are far from perfect, but this is to beexpected. One difficulty is that the neither of the learners perfectlyfit the training data; another is that the search engine itself isincomplete. However, it seems quite likely that even this level ofperformance is enough to be useful. It is instructive to compare thesehypotheses to the original resource directories that were used as inputfor the interactive system. The original directories all have perfectprecision, but relatively poor recall. For the jogging strollersproblem, both the learners are able to obtain nearly twice the recall(48% vs 25%) at 91% precision. For the AI societes problem, bothlearners obtain more than three times the recall at 94% precision orbetter. (RIPPER obtains 57% vs 15% recall with 94% precision.

We also conducted a generalization error experiment on the datasets. Ineach trial, a random 80% of the dataset was used for training and theremainder for testing. A total of 50 trials were run for each dataset,and the average error rate, precision and recall on the test set (usingthe default parameters of the learners) were recorded.

The results are shown in FIG. 9. However, since the original sample isnon-random, these numbers should be interpreted with great caution.Although the results suggest that significant generalization is takingplace, they do not demonstrate that the learned queries can fulfilltheir true goal of facilitating maintenance by alerting the maintainerto new examples of a concept. This would require a study spanning areasonable period of time.

Summary

The World Wide Web (WWW) is currently filled with resource directories—documents that collect together links to all known documents on aspecific topic. Keeping resource directories up-to-date is difficultbecause of the rapid growth in on-line documents. This inventiondescribes the use of machine learning methods as an aid in maintainingresource directories. A resource directory is treated as an exhaustivelist of all positive examples of an unknown concept, thus yielding anextensional definition of the concept. Machine learning methods can thenbe used to construct from these examples an intensional definition ofthe concept. The learned definition is in DNF form, where the primitiveconditions test the presence (or even the absence) of particular words.This representation can be easily converted to a series of queries thatcan be used to search for the original documents —as well as new,similar documents that have been added recently to the WWW.

Two systems were implemented to test these ideas, both of which makeminimal assumptions about the search engine. One is a batch system whichrepeatedly learns a concept, generates an appropriate set searchqueries, and uses the queries to collect more negative examples. Anadvantage of this procedure is that it can collect hundreds of exampleswith no human intervention; however, it can only be used if the initialresource list is complete (or nearly so). The second is an interactivesystem. This systems augments an arbitrary WWW browser with the abilityto label WWW documents and then learn search—engine queries from thelabeled documents. It can be used to perform the same sorts of sequencesof actions as the batch system, but is far more flexible. In particular,keeping a human user “in the loop” means that positive examples not onthe original resource list can be detected. These examples can be addedto the resource list both extending the list and improving the qualityof the dataset used for learning. In experiments, these systems produceusefully accurate intensional descriptions of concepts. In two of threetest problems, the concepts produced had substantially higher recallthan manually—constructed lists, while attaining precision of greaterthan 90%.

In support of the invention, and in particular the description ofPreferred Embodiment, the following Appendices are included in theapplication:

Appendix 1. A list of references cited in the application by referencenumeral.

Appendix 2. A copy of a README file which describes the source codeimplementing the presently—preferred embodiment of the invention.

Appendix 3. Commented source code written in perl for thepresently—preferred embodiment of the invention.

This is a computer program listing appendix, submitted on a CD andincorporated by reference in its entirety.

Appendix 4. A copy of the documentation for the OreO shell tool whichwas used in the implementation of the presently—preferred embodiment.

While the invention has been shown and described with respect topreferred imbodiments, various modifications can be made therein withoutdeparting from the spirit and scope of the invention, as described inthe specification and defined in the claims, as follows:

We claim:
 1. A method of adding new documents to a resource list ofexisting documents, executable in a computer system, comprising thesteps of: learning a rule for which the documents on the resource listare positive examples of a class of selection information which selectsthe documents on the resource list; making a persistent associationbetween the selection information and the resource list; using theselection information independent for a meta search engine to identifydata on a plurality of items characterized as positive and/or negativeexamples of the class of information to select a set of documents whichthe information specifies; and adding new documents to the resourcelist, the new documents being added belonging to a subset of theselected set of documents which contains documents which are not alreadyon the resource list.
 2. The method set forth in claim 1 wherein thestep of adding documents comprises the steps of: interactivelydetermining whether a document in the subset should be added to theresource list; and adding the document only if it has been determinedthat the document should be added.
 3. The method set forth in claim 2further comprising the steps of: using a document for which it has beendetermined that the document should not be added together with documentson the resource list to learn new selection information; and associatingthe new selection information with the resource list.
 4. The method setforth in claim 1 wherein the step of learning the selection informationcomprises the steps of: learning a rule for which the documents on theresource list are positive examples; translating the rule into a query;and in the step of using the selection information, using the query toselect the set of documents.
 5. The method set forth in any of claims 1through 4 wherein: the system in which the method is practiced hasaccess to a plurality of searching means; the step of learning theselection information learns a plurality of queries as required by theplurality of searching means; and the step of using the selectioninformation to select a set of documents uses the plurality of queriesin the plurality of searching means.
 6. The method set forth in claim 5wherein: the system in which the method is practiced has access to theworld wide web; and the searching means are searching means in the worldwide web.
 7. An improved web page of a type which contains a list ofdocuments, the improvement comprising: a machine learning system forlearning a rule for which the documents on the web page are positiveexamples of a class of selection information which selects the documentson the web page; a meta search engine using selection information toidentify data on a plurality of items characterized as positive and/ornegative examples of a class of information associated with the web pagewhich selects documents having content which is similar to the documentson the list, whereby the list of documents on the web page is updatedusing the selection information.
 8. In a computer system, apparatus, formaking a resource list of documents which have contents belonging to thesame class, the apparatus comprising: a first list of documents, all ofwhich have contents as positive examples belonging to the class; asecond list of documents, none of which have contents as negativeexamples belonging to the class; learning means responsive to the firstlist of documents and the second list of documents for learning a rulefor which the documents on the resource list arc positive examples ofthe class of selection information which specifies documents whosecontents belong to the class; meta search means responsive to theselection information for finding the documents whose contents belong tothe class, using the documents to make the resource list, and making apersistent association between the selection information and theresource list.
 9. The apparatus set forth in claim 8 further comprising:first interactive means for indicating whether a given document is to beadded to the first list or the second list.
 10. The apparatus set forthin claim 9 further comprising: second interactive means for activatingthe learning means.
 11. The apparatus set forth in claim 10 furthercomprising: third interactive means for activating the means for findingthe documents.
 12. The apparatus set forth in any of claims 9 through 11wherein: the apparatus is used in a system which includes a documentbrowser; and the interactive means of the claim are implemented in thedocument browser.
 13. In an information system which stores related dataand information as items for a plurality of interconnected computersaccessible by a plurality of users, a method for finding items of aparticular class residing in the information system comprising the stepsof: a) identifying as training data a plurality of items characterizedas positive and/or negative examples of the class; b) using a learningtechnique to generate from the training data at least one query that canbe submitted to any of a plurality of methods for searching theinformation system; c) submitting said query to meta search means andcollecting any new item(s) as a response to the query; d) evaluating thenew item(s) by a learned model with the aim of verifying that the newitem(s) is indeed a new subset of the particular class; and e)presenting the new subset of the new item(s) to a user of the system.14. The method of claim 13 wherein the information system is adistributed information system (DIS) and the items are documentscollected in resource directories in the DIS.
 15. The method of claim 14wherein step a) the positive examples are a set of documents in theresource directories and the negative examples are a selection ofdocuments obtained by using the process of steps a-d.
 16. The method ofclaim 15 wherein step b) the query is (i) a conjunction of terms whichmust appear in a document as a positive example; (ii) contains all theterms appearing in the training data covered by the query, and (iii)learned by the system using a propositional rule—learning or predictionalgorithm method.
 17. The method of claim 16 wherein step d) a learningtechnique generates from the training data a learned model that computesa score for the new item(s), such that the new item(s) which has a lowprobability of being classified within the particular class.
 18. Themethod of claim 17 further comprising the step of providing a user onthe system an ordered list of the new item(s) according to the scoreassigned by the learned model.
 19. The method of claim 17 furthercomprising the step of providing a user by electronic mail or facsimilean ordered list of the new item(s) having a score exceeding a thresholdprobability.
 20. The method of claim 17 further comprising the step ofusing an batch process to identify documents as positive or negativeexamples of the search concept.
 21. The method of claim 17 furthercomprising the step of using an interactive process to identifydocuments as positive examples of the search concept by browsing thedistributed information system.
 22. The method of claim 17 furthercomprising the step of resubmitting a query to the system to detect anynew item added to the system and related to the query.
 23. Aninformation system which stores related data and information as itemsfor a plurality of interconnected computers accessible by a plurality ofusers for finding items of a particular class residing in theinformation system using query learning and meta search, comprising: a)means for identifying as training data in the system a plurality ofitems characterized as positive and/or negative examples of the class;b) means for using a learning technique to generate from the trainingdata at least one query that can be submitted to any of a plurality ofsearch engines for searching the information system; c) means forsubmitting said query to a meta search engine and collecting any newitem(s) as a response to the query; d) means for evaluating the newitem(s) by the at least one search engine with the aim of verifying thatthe new item(s) is indeed a new subset of the particular class; and e)means for presenting the new subset of the new item(s) to a user of thesystem.
 24. The system of claim 23 wherein the information system is adistributed information system (DIS) and the items are documents storedin resource directories in the DIS.
 25. The system of claim 24 whereinthe positive examples are a set of items in the resource directories andthe negative examples are a selection of documents obtained by thesearch engine in responding to the query.
 26. The system of claim 25 thequery is (i) a conjunction of terms which must appear in a document as apositive example; (ii) contains all the terms appearing in the trainingdata covered by the query, and iii) learned by the system using apropositional rule—learning or prediction algorithm method.
 27. Thesystem of claim 26 wherein step the learning technique generates fromthe training data a learned model that computes a score for the newitem(s), such that the new item(s) which has a high probability of beingclassified within the particular class will be assigned a higher scorethan the new item(s) which has a low probability of being classifiedwithin the particular class.
 28. The system of claim 27 furthercomprising means for providing a user on the system an ordered list ofthe new item(s) according to the score assigned by the learned model.29. The system of claim 27 further comprising means for providing a userby electronic mail or facsimile an ordered list of the new item(s)having a score exceeding a threshold probability.
 30. The system ofclaim 27 further comprising means for using a batch process to selectdocuments as positive examples of the search concept.
 31. The system ofclaim 27 further comprising means for using an interactive process toidentify documents as positive examples of the query by browsing thedistributed information system.
 32. The system of claim 27 furthercomprising means for resubmitting a query to the system to detect anynew item added to the system and related to the query.
 33. An article ofmanufacture comprising: a computer useable medium having computerreadable program code means embodied therein for finding items of aparticular class residing an information system which stored relateddata and information as items for a plurality of interconnectedcomputers accessible by a plurality of users, the computer readableprogram code means in said article of manufacture comprising: a) programcode means for identifying as training data a plurality of itemscharacterized as positive and/or negative examples of the class; b)program code means for using a learning technique to generate from thetraining data at least one query that can be submitted to any of aplurality of methods for searching the information system; c) programcode means for submitting said query to meta search means and collectingany new item(s) as a response to the query; d) program code means forevaluating the new item(s) by meta search means with the aim ofverifying that the new item(s) is indeed a new subset of the particularclass; and e) program code means for presenting the new subset of newitem(s) to a user of the system.